Recap
Interpretation
Interpreting simple scenarios:
\[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\]
This is the matrix formula for least squares regression.
If \(X\) is a column vector of \(1\)s, \(\beta\) is just the mean of \(Y\). (We just did this)
If \(X\) is a column of \(1\)s and a column of \(x\)s, it is bivariate regression.
We can now add \(p > 2\): more variables
\[\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}; \mathbf{Y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}\]
Matrix equation fits mean of \(Y\) as a linear function of many variables:
\[Y_i = b_0 + b_1 X_1 + b_2 X_2 + \ldots + b_k X_k\]
The mathematical procedures we use in regression ensure that:
\(1\). the mean of the residuals is always zero. Because we included an intercept (\(a\)), and the regression line goes through the point of averages, the mean of the residuals is always 0. \(\overline{e} = 0\). This is also true of residuals of the mean.
The mathematical procedures we use in regression ensure that:
\(2\). \(Cov(X,e) = 0\). This is true by definition of how we derived least squares.
The mathematical procedures we use in regression ensure that:
\(3\). \(\overline{Y} = \overline{X}'\beta\): the predicted value \(\hat{Y}\) when all variables in \(X\) are at their mean is the mean of \(Y\).
The mathematical procedures we use in regression ensure that:
\(4\). Slope \(b_k\) captures variation in \(Y\) due to variation in \(X_k\) that is orthogonal/uncorrelated to all other columns of \(X\).
\[Y = b_0 + b_1 X_1 + b_2 X_2 + \ldots + b_k X_k\]
\(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\)
where \(X_k^* = X_k - \hat{X_k}\) obtained from the regression:
\(X_{k} = c_0 + c_1 x_{1} + \ldots + c_{j} X_{j}\)
\(X_k^*\) is the residual from regressing \(X_k\) on all other \(X_{j \neq k}\)
How do we make sense of \(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\) (if \(X_k^*\) as residual \(X_k\) after regressing on all other \(X_{j \neq k}\)?)
How does multivariate regression relate back to Conditional Expectation Function?
Least Squares predictions given by \(\widehat{Y_i} = X_i'\beta\) gives the linear approximation of \(E(Y_i | X_i)\) that has the smallest mean-squared error (linear approximation with minimum distance to the truth).
Exercise: Download the data grouped by AGE in years
if (!require(httr)) install.packages('httr')
r = GET("http://mdweaver.github.io/poli572b/lecture_7/example_7.csv")
cef = as.data.frame(content(r))
Exercise:
Use matrix calculations in R to obtain linear approximation of \(E(Y_i | X_i)\):
#With individual data...
X = cbind(1, acs_data$AGE)
y = acs_data$INCEARN
beta = solve(t(X) %*% X) %*% t(X) %*% y
beta## [,1]
## [1,] 23404.561
## [2,] 3535.247
Do your estimates agree? Why or why not?
We are not accounting for the number of people in each value of \(X_i\)
#With aggregate data...
X = cbind(1, cef$AGE)
y = cef$INCEARN
w = cef$respondents
w = w * diag(length(w))
beta = solve(t(X) %*% w %*% X) %*% t(X) %*% w %*% y
beta## [,1]
## [1,] 23404.561
## [2,] 3535.247
We can reproduce any individual-level least squares by:
We are directly modelling the conditional expectation of \(E(Y_i | X_i)\) as a linear function.
What are they?
Binary variables taking \(1\) if the observation has some attribute, \(0\) otherwise.
Can be used for variables with many categories (e.g. race, marital status, etc.)
How do we make them?
R with as.factor(): each unique value will get a dummy.lm(y ~ as.factor(any_variable))
m1 = lm(INCEARN ~ FEMALE, acs_data)m2 = lm(INCEARN ~ FEMALE + MALE, acs_data)m3 = lm(INCEARN ~ -1 + FEMALE + MALE, acs_data)#Design matrix
model.matrix(m1) %>% head## (Intercept) FEMALE
## 1 1 1
## 2 1 0
## 3 1 0
## 4 1 1
## 5 1 0
## 6 1 0
#Interpret:
coefficients(m1)## (Intercept) FEMALE
## 203342.42 -61756.51
#Design matrix
model.matrix(m2) %>% head## (Intercept) FEMALE MALE
## 1 1 1 0
## 2 1 0 1
## 3 1 0 1
## 4 1 1 0
## 5 1 0 1
## 6 1 0 1
#Interpret:
coefficients(m2)## (Intercept) FEMALE MALE
## 203342.42 -61756.51 NA
#Design matrix
model.matrix(m3) %>% head## FEMALE MALE
## 1 1 0
## 2 0 1
## 3 0 1
## 4 1 0
## 5 0 1
## 6 0 1
#Interpret:
coefficients(m3)## FEMALE MALE
## 141585.9 203342.4
What do they do?
If there is an intercept
R will do this by default)If there is no intercept:
How did different regions in the UK vote on Brexit?
Download the data:
if (!require(httr)) install.packages('httr')## Loading required package: httr
r = GET("http://mdweaver.github.io/poli572b/lecture_7/analysisData.csv")
brexit = as.data.table(content(r))## Parsed with column specification:
## cols(
## .default = col_double(),
## Regn_Cd = col_character(),
## Region = col_character(),
## Area_Cd = col_character(),
## Area = col_character()
## )
## See spec(...) for full column specifications.
table(brexit$Region)##
## East East Midlands London
## 47 40 33
## North East Northern Ireland North West
## 12 1 39
## Scotland South East South West
## 32 67 37
## Wales West Midlands Yorkshire and The Humber
## 22 30 21
hist(brexit$Pct_Lev)Using the lm function:
Pct_Lev on Regionsummary(your_model))Pct_Lev on -1 + Regionyour_model$fitted.values, what can you say about the \(\widehat{Y}\) produced by these two regressions?##
## Call:
## lm(formula = Pct_Lev ~ RegionU, data = brexit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.8121 -4.7843 0.4257 5.1457 30.5685
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.13687 1.39400 28.075 < 2e-16 ***
## RegionULondon -0.04536 1.95643 -0.023 0.982
## RegionUNorth West 16.77825 1.88088 8.920 < 2e-16 ***
## RegionUYorkshire and The Humber 19.51312 2.21458 8.811 < 2e-16 ***
## RegionUNorth East 20.34229 2.66931 7.621 2.15e-13 ***
## RegionUWest Midlands 21.17779 2.00400 10.568 < 2e-16 ***
## RegionUEast Midlands 20.43762 1.87025 10.928 < 2e-16 ***
## RegionUSouth West 14.54745 1.90365 7.642 1.87e-13 ***
## RegionUEast 17.82525 1.80729 9.863 < 2e-16 ***
## RegionUSouth East 13.03312 1.69451 7.691 1.34e-13 ***
## RegionUWales 14.21085 2.18398 6.507 2.50e-10 ***
## RegionUNorthern Ireland 5.08312 8.00793 0.635 0.526
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.886 on 369 degrees of freedom
## Multiple R-squared: 0.443, Adjusted R-squared: 0.4264
## F-statistic: 26.68 on 11 and 369 DF, p-value: < 2.2e-16
How is this different from what you produced?
During the Brexit vote, many areas thought to support “Remain” experienced heavy rainfall. total_precip_prepolling is the number of mm of rain that fell just before polls were open.
Pct_Lev on total_precip_prepolling + Region.total_precip_prepolling on calculated in this case? (think of \(X_k^*\))total_precip_prepolling mean?##
## Call:
## lm(formula = Pct_Lev ~ total_precip_prepolling + Region, data = brexit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.7877 -4.7351 0.4749 4.7649 28.7536
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.55803 1.49590 35.135 < 2e-16 ***
## total_precip_prepolling 0.05186 0.01165 4.451 1.13e-05 ***
## RegionEast Midlands 7.01178 1.92738 3.638 0.000314 ***
## RegionLondon -19.26625 1.77485 -10.855 < 2e-16 ***
## RegionNorth East 6.92113 2.67736 2.585 0.010120 *
## RegionNorthern Ireland -8.33820 7.83610 -1.064 0.287993
## RegionNorth West 3.35710 1.93773 1.732 0.084024 .
## RegionScotland -13.42789 2.02081 -6.645 1.10e-10 ***
## RegionSouth East -3.11209 1.51142 -2.059 0.040193 *
## RegionSouth West 1.11001 1.95693 0.567 0.570911
## RegionWales 0.78957 2.21970 0.356 0.722260
## RegionWest Midlands 7.75486 2.05162 3.780 0.000183 ***
## RegionYorkshire and The Humber 6.09083 2.24826 2.709 0.007061 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.692 on 368 degrees of freedom
## Multiple R-squared: 0.4715, Adjusted R-squared: 0.4542
## F-statistic: 27.35 on 12 and 368 DF, p-value: < 2.2e-16
You can change the “reference group” like this. This code shifts “Scotland” to the reference by making it the first “level” in the factor.
levels = c('Scotland', 'London', 'North West', 'Yorkshire and The Humber', 'North East', 'West Midlands', 'East Midlands', 'South West', 'East', 'South East', 'Wales', 'Northern Ireland')
brexit[, RegionU := factor(Region, levels = levels)]
Let’s say we want to get the conditional expectation function for yearly earnings of professionals as a function of hours worked, profession, gender, and age.
Using data from the American Community Survey, we can look at how hours worked is related to earnings for doctors and lawyers.
UHRSWORK (the usual hours worked per week)FEMALE (an indicator for female)AGE (in years)LAW (an indicator for being a lawyer)INCEARN (total annual income earned in dollars).Let’s now find the (approximate) linear conditional expectation function of earnings, across hours worked, age, profession, and gender:
ls = lm(INCEARN ~ UHRSWORK + AGE + LAW + FEMALE, acs_data)##
## Call:
## lm(formula = INCEARN ~ UHRSWORK + AGE + LAW + FEMALE, data = acs_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -291485 -89191 -28271 71281 1067525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9929.8 9370.4 1.06 0.289
## UHRSWORK 1245.4 111.9 11.13 <2e-16 ***
## AGE 3311.3 125.0 26.50 <2e-16 ***
## LAW -49672.0 2647.1 -18.76 <2e-16 ***
## FEMALE -42281.4 2811.5 -15.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 127300 on 9995 degrees of freedom
## Multiple R-squared: 0.1502, Adjusted R-squared: 0.1498
## F-statistic: 441.6 on 4 and 9995 DF, p-value: < 2.2e-16
How do we make sense of, e.g., the slope on FEMALE?
In your small groups:
could we do better at better approximating the conditional expectation function, e.g., of Earnings by Age?
How would you do it?
Try it out
How would you apply this to the example above? ls = lm(INCEARN ~ UHRSWORK + AGE + LAW + FEMALE, acs_data)?